Introduction
2024-08-28
Dyr og data is your first foray into learning key data science skills
This is a course in applied data science
We’re not trying to turn to into computer scientists or statisticians
We do want you to use data science skills during your education & when you graduate
Introduction
What is Data Science?
Data types, storage, security and ethics
Data handling and wrangling
Data Visualization
Descriptive and exploratory data analysis
Statistical thinking and ‘data literacy’
Dynamic reporting in document and presentation format
Databases
At the end of the course
This course is different
Mostly be working in groups during class time
Outside of class you will:
No free version 😭
We will use most of this book by the time you graduate
Physical copies in the bookstore
Won’t use it yet for a few weeks
You will be assessed as passing the course, or not
Student gives a 5 minute presentation on a randomly selected portfolio project (48 hrs prep)
Followed by 15 minutes of questions on the course syllabus
We’ll introduce these to you later in the course
posit.cloud
Sign up for a free account — invites sent out this morning
By direct email:
Expect a response within 48 hours (2 working days)
During the week responses usually within 24 hours
If you send an email after 4pm on Friday don’t expect a response until Monday at the earliest
Email to arrange a meeting as needed
Mona and have randomly assigned you to groups
From Friday please sit with your group
Group 1
Group 2
Group 3
Group 4
Group 5
Laptop!
Textbook
At the end of this topic you should be able to
Articulate what data science is
Understand at a high level the steps involved in doing data science
Describe the roles and skills of a data scientist
Data science encompasses a set of principles, problem definitions, algorithms, and processes for extracting non-obvious and useful patterns from large data sets
Kelleher & Tierney, pp. 1
Related fields
Data science is broader, borrowing from these fields and many other
http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
Data science encompasses a set of principles, problem definitions, algorithms, and processes for extracting non-obvious and useful patterns from large data sets
Kelleher & Tierney, pp. 1
Data science outputs are only useful if we or others can make use of them
Does data science provide us with information that wasn’t obvious?
Can we do something useful with the new information?
Data Architecture
Data Acquisition
Data Analysis
Data Archiving
Provide input on how data need to routed and organized to support the
analysis,
visualization, and
presentation of data
How should the data be collected and represented prior to analysis?
Important tasks that need to happen before data can be profitably analyzed are
representing data
transforming data
grouping
linking
How can we summarize data?
Use samples of data to make inferences about the larger context or population
Visualize data and analysis outputs in graphs, tables, animations, dashboards
Communicate the results of the analysis
How should we preserve data that has been collected?
What forms of the data need to be preserved
Difficult to anticipate future uses of data
Important to learn the application domain
Need to know enough to
understand the problem
understand why the problem is important
how data science might address the problem
If data are important enough to collect, they’re important enough to affect people’s lives
Need to understand ethical issues
privacy, personal data-protection
biases in data & models
limitations of the data
prevent misuse
Working with data, files, & databases are essential skills
understand how data are stored
transform data
generate metadata
how to link data
query databases with & SQL
Computer science & HPC provides algorithms & data structures to tackle increasingly large amounts of data
algorithms
distributed computing & map reduce
use computer clusters to parallelise operations
Know how to present data in forms that are suitable and that aid decision making
theory behind perception
encoding data graphically
appropriate plots
grammar of graphics
create infographics
dashboards
Statistics is the field of science concerned with making inferences from samples of data drawn from larger populations
exploratory data analysis
summarize data
use statistical methods to make inferences
communicate results of statistical models
An offshoot from statistics (statistical learning) & computer science
underlying principals of machine learning methods
model assessment
variable importance
neural networks
tree-based models
prediction vs explanation
Communicating with end users, data generators, etc is an essential component of any applied science
Need to translate technical language of animal science, computer science, statistics, machine learning to the language used in specific domains
communicate with specialists
communicate with end users
aid decision making
communicate uncertainty
Read from r4ds
Watch a short video about Posit.cloud
Watch a short video about running R code